Goto

Collaborating Authors

 gene tree


PhyloVAE: Unsupervised Learning of Phylogenetic Trees via Variational Autoencoders

Xie, Tianyu, Richman, Harry, Gao, Jiansi, Matsen, Frederick A. IV, Zhang, Cheng

arXiv.org Machine Learning

Learning informative representations of phylogenetic tree structures is essential for analyzing evolutionary relationships. Classical distance-based methods have been widely used to project phylogenetic trees into Euclidean space, but they are often sensitive to the choice of distance metric and may lack sufficient resolution. In this paper, we introduce phylogenetic variational autoencoders (PhyloVAEs), an unsupervised learning framework designed for representation learning and generative modeling of tree topologies. Leveraging an efficient encoding mechanism inspired by autoregressive tree topology generation, we develop a deep latent-variable generative model that facilitates fast, parallelized topology generation. Phylo-VAE combines this generative model with a collaborative inference model based on learnable topological features, allowing for high-resolution representations of phylogenetic tree samples. Extensive experiments demonstrate PhyloVAE's robust representation learning capabilities and fast generation of phylogenetic tree topologies. Phylogenetic trees are the foundational structure for describing the evolutionary processes among individuals or groups of biological entities. Reconstructing these trees based on collected biological sequences (e.g., DNA, RNA, protein) from observed species, also known as phylogenetic inference (Felsenstein, 2004), is an essential discipline of computational biology (Fitch, 1971; Felsenstein, 1981; Yang & Rannala, 1997; Ronquist et al., 2012). Large collections of trees obtained from these approaches (e.g., posterior samples from MCMC runs (Ronquist et al., 2012)), however, are often difficult to summarize or visualize due to the discrete and non-Euclidean nature of the tree topology space The classical approach to visualize and analyze distributions of phylogenetic trees is to calculate pairwise distances between the trees and project them into a plane using multidimensional scaling (MDS) (Amenta & Klingner, 2002; Hillis et al., 2005; Jombart et al., 2017). However, these approaches have the shortcoming that one can not map an arbitrary point in the visualization to a tree, and therefore do not form an actual visualization of the relevant tree space.


Response to Comment on "Ancient origins of allosteric activation in a Ser-Thr kinase"

Science

Park et al. question one out of seven findings from Hadzipasic et al.: whether TPX2 allosterically regulates the oldest Aurora. We had already addressed the two concerns raised--sparse sequence sampling and not forcing the gene to the species tree--before publication. Moreover, we believe their ancestral sequence reconstruction would be consistent with a nonallosteric common ancestor, and we show large sequence differences caused by species tree–enforced gene trees. The key findings in Hadzipasic et al. (1) are that (i) autophosphorylation is the ancient allosteric regulation for Aurora kinases; (ii) a gradual increase in allosteric activation took place during the holozoan evolution; (iii) an allosteric network in Aurora exists that, when mutated, alters allosteric activity; (iv) allosteric activation by TPX2 is entirely encoded in the kinase; (v) the interface between Aurora and TPX2 is co-conserved; (vi) evolution of specificity in signaling happens on binding affinity; and (vii) the oldest ancestral Aurora is not allosterically activated by TPX2. Notably, even though the ASR calculations differ, we believe the outcome is consistent with, rather than contradicting, the finding. The two concerns raised are (i) the small number of modern sequences used in the ASR calculations and (ii) the mismatch between the gene tree and the species tree.

  Country: Asia > Indonesia > Bali (0.05)
  Genre: Research Report > New Finding (0.30)

Tropical Support Vector Machine and its Applications to Phylogenomics

Tang, Xiaoxian, Wang, Houjie, Yoshida, Ruriko

arXiv.org Machine Learning

Most data in genome-wide phylogenetic analysis (phylogenomics) is essentially multidimensional, posing a major challenge to human comprehension and computational analysis. Also, we cannot directly apply statistical learning models in data science to a set of phylogenetic trees since the space of phylogenetic trees is not Euclidean. In fact, the space of phylogenetic trees is a tropical Grassmannian in terms of max-plus algebra. Therefore, to classify multi-locus data sets for phylogenetic analysis, we propose tropical Support Vector Machines (SVMs) over the space of phylogenetic trees. Like classical SVMs, a tropical SVM is a discriminative classifier defined by the tropical hyperplane which maximizes the minimum tropical distance from data points to itself in order to separate these data points into open sectors. We show that we can formulate hard margin tropical SVMs and soft margin tropical SVMs as linear programming problems. In addition, we show the necessary and sufficient conditions for each data point to be separated and an explicit formula for the optimal solution for the feasible linear programming problem. Based on our theorems, we develop novel methods to compute tropical SVMs and computational experiments show our methods work well. We end this paper with open problems.


Data Requirement for Phylogenetic Inference from Multiple Loci: A New Distance Method

Dasarathy, Gautam, Nowak, Robert, Roch, Sebastien

arXiv.org Machine Learning

We consider the problem of estimating the evolutionary history of a set of species (phylogeny or species tree) from several genes. It is known that the evolutionary history of individual genes (gene trees) might be topologically distinct from each other and from the underlying species tree, possibly confounding phylogenetic analysis. A further complication in practice is that one has to estimate gene trees from molecular sequences of finite length. We provide the first full data-requirement analysis of a species tree reconstruction method that takes into account estimation errors at the gene level. Under that criterion, we also devise a novel reconstruction algorithm that provably improves over all previous methods in a regime of interest.